You can also embed plots, for example:
diabetes_dataset
## # A tibble: 100,000 × 9
## gender age hypertension heart_disease smoking_history bmi HbA1c_level
## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Female 80 0 1 never 25.2 6.6
## 2 Female 54 0 0 No Info 27.3 6.6
## 3 Male 28 0 0 never 27.3 5.7
## 4 Female 36 0 0 current 23.4 5
## 5 Male 76 1 1 current 20.1 4.8
## 6 Female 20 0 0 never 27.3 6.6
## 7 Female 44 0 0 never 19.3 6.5
## 8 Female 79 0 0 No Info 23.9 5.7
## 9 Male 42 0 0 never 33.6 4.8
## 10 Female 32 0 0 never 27.3 5
## # ℹ 99,990 more rows
## # ℹ 2 more variables: blood_glucose_level <dbl>, diabetes <dbl>
# male datatset
male_data = diabetes_dataset %>% filter(gender == "Male")
# female dataset
female_data = diabetes_dataset %>% filter(gender == "Female")
female_data
## # A tibble: 58,552 × 9
## gender age hypertension heart_disease smoking_history bmi HbA1c_level
## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Female 80 0 1 never 25.2 6.6
## 2 Female 54 0 0 No Info 27.3 6.6
## 3 Female 36 0 0 current 23.4 5
## 4 Female 20 0 0 never 27.3 6.6
## 5 Female 44 0 0 never 19.3 6.5
## 6 Female 79 0 0 No Info 23.9 5.7
## 7 Female 32 0 0 never 27.3 5
## 8 Female 53 0 0 never 27.3 6.1
## 9 Female 54 0 0 former 54.7 6
## 10 Female 78 0 0 former 36.0 5
## # ℹ 58,542 more rows
## # ℹ 2 more variables: blood_glucose_level <dbl>, diabetes <dbl>
# males and females within original dataset that have a "normal" A1C
female_data %>% filter(HbA1c_level <= 5.7) %>% tally()
## # A tibble: 1 × 1
## n
## <int>
## 1 27397
male_data %>% filter(HbA1c_level <= 5.7) %>% tally()
## # A tibble: 1 × 1
## n
## <int>
## 1 18865
# count of people (male and female) with both heart disease and diabetes
diabetes_dataset %>% filter(diabetes == 1, heart_disease == 1)
## # A tibble: 1,267 × 9
## gender age hypertension heart_disease smoking_history bmi HbA1c_level
## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Male 67 0 1 not current 27.3 6.5
## 2 Male 57 1 1 not current 27.8 6.6
## 3 Male 80 0 1 former 24.4 7.5
## 4 Male 75 0 1 not current 28.1 7.5
## 5 Male 69 0 1 former 24.1 6.8
## 6 Female 59 0 1 never 60.3 8.8
## 7 Male 80 0 1 former 33.0 6
## 8 Female 62 1 1 former 44.2 8.2
## 9 Female 62 1 1 never 43.2 8.8
## 10 Female 76 0 1 former 25.7 9
## # ℹ 1,257 more rows
## # ℹ 2 more variables: blood_glucose_level <dbl>, diabetes <dbl>
diabetes_dataset %>% filter(diabetes == 1, heart_disease == 1) %>% tally()
## # A tibble: 1 × 1
## n
## <int>
## 1 1267
| gender | age | hypertension | heart_disease | smoking_history | bmi | HbA1c_level | blood_glucose_level | diabetes |
|---|---|---|---|---|---|---|---|---|
| Female | 80 | 0 | 1 | never | 25.19 | 6.6 | 140 | 0 |
| Female | 54 | 0 | 0 | No Info | 27.32 | 6.6 | 80 | 0 |
| Male | 28 | 0 | 0 | never | 27.32 | 5.7 | 158 | 0 |
| Female | 36 | 0 | 0 | current | 23.45 | 5.0 | 155 | 0 |
| Male | 76 | 1 | 1 | current | 20.14 | 4.8 | 155 | 0 |
| Female | 20 | 0 | 0 | never | 27.32 | 6.6 | 85 | 0 |
| gender | age | hypertension | heart_disease | smoking_history | bmi | HbA1c_level | blood_glucose_level | diabetes | HbA1c_category |
|---|---|---|---|---|---|---|---|---|---|
| Female | 80 | 0 | 1 | never | 25.19 | 6.6 | 140 | 0 | Diabetes ≥ 6.5% |
| Female | 54 | 0 | 0 | No Info | 27.32 | 6.6 | 80 | 0 | Diabetes ≥ 6.5% |
| Male | 28 | 0 | 0 | never | 27.32 | 5.7 | 158 | 0 | Prediabetes 5.7% - 6.4% |
| Female | 36 | 0 | 0 | current | 23.45 | 5.0 | 155 | 0 | Normal < 5.7% |
| Male | 76 | 1 | 1 | current | 20.14 | 4.8 | 155 | 0 | Normal < 5.7% |
| Female | 20 | 0 | 0 | never | 27.32 | 6.6 | 85 | 0 | Diabetes ≥ 6.5% |
Similar Prevalence of Prediabetes – The proportion of individuals categorized as having prediabetes (HbA1c 5.7% - 6.4%) is almost identical between males (41.3%) and females (41.4%). This suggests that prediabetes affects both genders at nearly the same rate.
Females Have a Slightly Higher Proportion of Normal Blood Sugar Levels – More females (38.4%) fall into the normal blood sugar category (<5.7%) compared to males (37.1%). This may indicate some slight protective factors or lifestyle differences in this group.
Since more males are in the diabetes category, there could be gender-related risk factors worth exploring—such as diet, activity levels, or genetic predisposition.
Overall, blood sugar regulation patterns appear fairly balanced between genders, but small differences suggest potential areas for further investigation.
<<<<<<< HEAD ======= <<<<<<< HEADSimilar Prevalence of Prediabetes
The proportion of individuals classified as having prediabetes (HbA1c
5.7% - 6.4%) is nearly identical between males (41.3%)
and females (41.4%). This suggests no significant
disparity.
Similar Prevalence of Prediabetes – The proportion of individuals classified as having prediabetes (HbA1c 5.7% - 6.4%) is nearly identical between males (41.3%) and females (41.4%). This indicates that prediabetes affects both genders at a comparable rate, suggesting no significant disparity.
Shows the distribution of BMI values based on hypertension status. A violin plot is great for visualizing the distribution and density of BMI across hypertension categories,
Shape and width: The width of each “violin” represents the density of BMI values at different levels. Wider sections mean more individuals have that BMI, while narrower sections indicate fewer people at those values.
Comparison of distributions: The blue violin represents people without hypertension (hypertension = 0), while the red violin represents those with hypertension (hypertension = 1). By comparing them, you can see how BMI differs between these groups.
The horizontal line around 25 BMI: This marks the median BMI for each group. Since both violins have a horizontal line in roughly the same position, it suggests that the median BMI is around 25 for both hypertensive and non-hypertensive individuals.
Density trends: If the violins have different thicknesses in certain BMI ranges, it tells you which BMI values are more or less common in each group. People with hypertension seem to have a higher BMI overall, but both groups share a similar median.
The distribution shape is different—for example, if one violin is wider at higher BMI values, it suggests that hypertension is more common among individuals with higher BMI.
Outliers or extreme values might appear as small bulges or extended tails at the ends of the violins, showing individuals with very high or low BMI.
<<<<<<< HEAD =======The proper way to read the following chart is to notice the “thickness” of each distribution. When the graph expands wider, that means there are more people within the data range. Notice the plot for people without hypertension, there is a “wider” range of people that have a lower bmi. This trend is mirrored for people with hypertension, as theres a wider range of people that have a higher bmi.
<<<<<<< HEAD ======= >>>>>>> 632ad52a5942dd4becb7156103d52e172939cd91 >>>>>>> c70c364e524d624589bc757850ba8ca760ce4687Here I’ll leave extra info for you guys regarding the gender column of the original data set
diabetes_dataset %>% filter(gender == 'Female') %>% tally # 58,552 we have 17,122 more females than males in this data set
## # A tibble: 1 × 1
## n
## <int>
## 1 58552
diabetes_dataset %>% filter(gender == 'Male') %>% tally # 41,430
## # A tibble: 1 × 1
## n
## <int>
## 1 41430
diabetes_dataset %>% filter(gender == 'Other') %>% tally # 18
## # A tibble: 1 × 1
## n
## <int>
## 1 18
One of the more interesting bits of data is that there are individuals that have an a1c of over 6.5 yet are not considered diabetic.
In the smoking data there are 6 unique values
The total amount of people who fall into each category is as follows;
There is quite a sizable amount of people in the No info category.
The total number of people in the dataset is 100000. To help clean up the data, we can filter ‘No Info’ people out. When we do that we get 64184.
# Figure out the unique categories of smoking history
unique(diabetes_dataset$smoking_history)
## [1] "never" "No Info" "current" "former" "ever"
## [6] "not current"
# Count amount of people who belong to each unique category
diabetes_dataset %>% group_by(smoking_history) %>% summarise(total_people = n())
## # A tibble: 6 × 2
## smoking_history total_people
## <chr> <int>
## 1 No Info 35816
## 2 current 9286
## 3 ever 4004
## 4 former 9352
## 5 never 35095
## 6 not current 6447
smoking_diabetes_dataset <- diabetes_dataset %>%
filter(smoking_history != 'No Info') %>%
group_by(smoking_history, diabetes) %>%
summarise(total = n())
## `summarise()` has grouped output by 'smoking_history'. You can override using
## the `.groups` argument.